Student Information

Name: Sandeep Kumar Yedla G Number:G01299433

Introduction

This semester we will be working with a dataset of all domestic outbound flights from Dulles International Airport in 2016.

Airports depend on accurate flight departure and arrival estimates to maintain operations, profitability, customer satisfaction, and compliance with state and federal laws. Flight performance, including departure and arrival delays must be monitored, submitted to the Federal Aviation Agency (FAA) on a regular basis, and minimized to maintain airport operations. The FAA considered a flight to be delayed if it has an arrival delay of at least 15 minutes.

The executives at Dulles International Airport have hired you as a Data Science consultant to perform an exploratory data analysis on all domestic flights from 2016 and produce an executive summary of your key insights and recommendations to the executive team.

Before you begin, take a moment to read through the following airline flight terminology to familiarize yourself with the industry: Airline Flight Terms

Dulles Flights Data

The flights_df data frame is loaded below and consists of 33,433 flights from IAD (Dulles International) in 2016. The rows in this data frame represent a single flight with all of the associated features that are displayed in the table below.

Note: If you have not installed the tidyverse package, please do so by going to the Packages tab in the lower right section of RStudio, select the Install button and type tidyverse into the prompt. If you cannot load the data, then try downloading the latest version of R (at least 4.0). The readRDS() function has different behavior in older versions of R and may cause loading issues.

## importing required libraries

library(tidyverse)
library(skimr)
library(dplyr)
library(plotly);
library(ggplot2)
library(paletteer)
library(corrplot)
library(RColorBrewer)
# importing the flight data
flights_df <- readRDS(url('https://gmubusinessanalytics.netlify.app/data/dulles_flights.rds'))

Raw Data

flights_df
skim(flights_df)
Data summary
Name flights_df
Number of rows 33433
Number of columns 22
_______________________
Column type frequency:
Date 1
factor 8
numeric 13
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
scheduled_flight_date 0 1 2016-01-01 2016-12-31 2016-07-12 364

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month 0 1 FALSE 12 Jul: 3120, Aug: 3092, Jun: 3071, Oct: 3051
weekday 0 1 FALSE 7 Fri: 5015, Thu: 4993, Wed: 4973, Tue: 4917
airline 0 1 FALSE 10 Uni: 20653, Ame: 2597, Del: 2565, Sou: 2161
tail_num 0 1 FALSE 2463 N63: 125, N69: 113, N66: 112, N66: 109
dest_airport_name 0 1 FALSE 40 San: 4034, Los: 3846, Den: 3628, Har: 3154
dest_airport_city 0 1 FALSE 36 San: 4034, Los: 3846, Den: 3628, Atl: 3154
dest_airport_state 0 1 FALSE 26 Cal: 9177, Col: 3657, Flo: 3511, Geo: 3154
dest_airport_region 0 1 FALSE 6 Wes: 15555, Sou: 7752, Nor: 4002, Mid: 3040

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
month_numeric 0 1 6.76 3.33 1.00 4.00 7.00 10.00 12.00 ▆▅▆▆▇
day 0 1 15.72 8.78 1.00 8.00 16.00 23.00 31.00 ▇▇▇▆▆
flight_num 0 1 1049.63 1022.03 10.00 365.00 685.00 1544.00 6840.00 ▇▂▁▁▁
sch_dep_time 0 1 14.00 4.96 5.33 8.83 14.90 17.75 22.95 ▆▃▃▇▃
dep_time 0 1 14.08 5.07 0.02 8.87 14.92 17.98 24.00 ▁▆▃▇▃
dep_delay 0 1 9.07 41.69 -25.00 -5.00 -2.00 3.00 1244.00 ▇▁▁▁▁
taxi_out 0 1 16.95 10.80 1.00 11.00 14.00 18.00 159.00 ▇▁▁▁▁
wheels_on 0 1 14.87 5.72 0.02 10.53 15.08 19.85 24.00 ▁▃▇▆▇
taxi_in 0 1 8.87 6.68 1.00 5.00 7.00 10.00 178.00 ▇▁▁▁▁
arrival_time 0 1 14.89 5.79 0.02 10.60 15.12 19.97 24.00 ▂▃▇▆▇
sch_arrival_time 0 1 14.94 5.71 0.02 10.77 15.28 20.00 23.98 ▁▃▇▆▇
arrival_delay 0 1 -0.55 45.47 -94.00 -20.00 -11.00 2.00 1228.00 ▇▁▁▁▁
distance 0 1 1354.55 839.61 157.00 534.00 1190.00 2288.00 4817.00 ▇▃▆▁▁
#View(flights_df)

Exploratory Data Analysis

Executives at this company have hired you as a data science consultant to evaluate their flight data and make recommendations on flight operations and strategies for minimizing flight delays.

You must think of at least 8 relevant questions that will provide evidence for your recommendations.

The goal of your analysis should be discovering which variables drive the differences between flights that are early/on-time vs. flights that are delayed.

Some of the many questions you can explore include:

You must answer each question and provide supporting data summaries with either a summary data frame (using dplyr/tidyr) or a plot (using ggplot) or both.

In total, you must have a minimum of 5 plots and 4 summary data frames for the exploratory data analysis section. Among the plots you produce, you must have at least 4 different types (ex. box plot, bar chart, histogram, heat map, etc…)

Each question must be answered with supporting evidence from your tables and plots.

## Subsetting the data with <=15 as less than 15 minutes are considered late
delayed_data <-flights_df %>% 
  filter(arrival_delay >=15)
delayed_data
## subsetting ontime data for furthur use
ontime_data <-flights_df %>% 
  filter(arrival_delay<15)
ontime_data
#View(delayed_data)

plot 1

Question 1

Are the months(season change or weather) affecting the flight arrival at destination?

Answer: Yes, We can see there is difference in flight arrival during few months, Some particlular months have more flight delays such as in july there are around 748 flights, in june there are 682, August 550 flights delayed, probably it must be summer such that there are more flights operating and due to the air traffic the flights are delayed, also in December we can see that there are 670 flights delayed probably december is winter and the snow or weather is causing the delay. if weather data is available we can analayze in more deatils. The pie chart descibes the percentages of delays occure from the dalayed flight data

Suggested Recommendation The flights schedule must be changed with analyzing the time they are departed from the below Do-nut plots, so that the flight delays are minimized.

weather<-select(delayed_data,month_numeric, airline, arrival_delay, distance,month)
weather
month_count<-weather %>% count(month, name = 'Dealys_in_each_month',sort = TRUE)
month_count

To add additional R code chunks for your work, select Insert then R from the top of this notebook file.

fig <- plot_ly(month_count, labels = ~month, values = ~Dealys_in_each_month, type = 'pie')
fig
fig <- fig %>% layout(title = 'Percentage of Delays occured in months',
                      xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
                      yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))%>%
  layout(xaxis = list(tickfont = list(size = 15)), yaxis = list(tickfont = list(size = 5)));

fig

#####plot 2

Question 2

Question

Are the hours or departing hour is affecting the flight arrival at destination due to air traffic?

Answer:

Yes, most of the delayed flgihts are departed during the hours around 16:30 (4:30 P.m) - 21:00 hrs (9:00 P.m), as most of the flights are departed during these hours. It is possible that due to sunset or less light the arrival time is getting affected.

Suggested Recommendation From the below analysis we can recommend that the flight departure schedule can be changed and some of the flights can be scheduled during the 0 (12:00 a.m) - 5 a.m so that the spread of the flights departure is even round the clock which can show impact on flights arrival.

ggplot(delayed_data, aes(x =dep_time)) +
       geom_histogram(aes(y = ..density..),bins = 50,
                      color = "gray", fill = "white") +
       geom_density(fill = "black", alpha = 0.2)+
    xlim(0,24)+
  labs(title="Overall Density Data Distribution ",x="Rating(1-5)",y="Density")+
  theme(plot.title = element_text(hjust = 0.52))

#plot 3 summary 1 ## Question 3

Question: Are the specific regions affecting the flight arrival delays?

Answer:

Yes, specific areas like the west region and south region has more flight delays when compared to Northeast, Midwest, Southwest, Middle atlantic. There among the 5258 flights delayed 1752 are delyed in the west region and 806 from the south region.

avg_delay<-mean(flights_df$dep_delay)
avg_delay ## 
[1] 9.070529
region_Cond<-delayed_data %>% filter(dep_delay>=avg_delay) %>% group_by(dest_airport_region) %>% summarise(number_of_delays_in_region=n(),Avg_arrival_delay_in_region=mean(arrival_delay))
region_Cond
# Uniform color
ggplot(region_Cond, aes(x = dest_airport_region, y = number_of_delays_in_region)) +
  geom_col(position = "dodge") +
  labs(title = "Flight dalays based on region",
            x = "Region",
            y = "Number of flights delayed")+
  geom_text(
    aes(label = number_of_delays_in_region),
    colour = "red", size = 4,
    vjust = 0.001, position = position_dodge(.95))


#####plot 4

## Question 4

**Question**

At what hour specifically the most of the flights are delayed?


**Solution**
Most of the flgihts are delayed in the evening hours and afternoon hours that is approximately between the 3 p.m to 9 p.m. and rest fo the hours are flgiht delays are little less comparitively from the below do-nut pie chart.



```r
flighttypes_df <- delayed_data %>%
  select(dep_time) %>%
  dplyr::mutate(flighttype = ifelse(dep_time <= 6, "Early_hours", ifelse(dep_time <= 12, "LateMorning", ifelse(dep_time <= 18, "Afternoon", "Evening")))) %>%
  group_by(flighttype) %>%
  dplyr::summarise(n = length(flighttype), .groups = 'keep') %>%
  group_by(flighttype) %>%
  mutate(percent_of_total = round(n*100/sum(n),1)) %>%
  ungroup %>%
  data.frame()

plot_ly(flighttypes_df, labels = ~flighttype, values = ~n) %>%
  add_pie(hole=0.6) %>%
  layout(title="Total Delays of Flights by Time of Day") %>%
  layout(annotations=list(text=paste0("Total Flight Delay Count: \n", 
                                      scales::comma(sum(flighttypes_df$n))), 
                          "showarrow"=F))

plot 5

Question 5

Are the distance travelled by the flight is affecting the flight arrival delays Question:

Yes, the distance travelled by the united airlines is more and we can see that the united airlines american and southwest data distributin is more, where we can say that the flight distance of travel is affecting the flight delay a bit. Answer:

distance_data <- select(delayed_data,distance, airline, arrival_delay)
##distance_data
airline_count<-distance_data %>% count(airline, name = 'Delayed_flight_count_of_airline',sort = TRUE)
airline_count
United_airlines <- filter(distance_data, airline == 'United')
United_airlines
ggplot(data = distance_data, mapping = aes(x = reorder(airline, distance,fun=median), 
                                 y = distance, fill = airline)) +
  geom_violin() +  
  geom_jitter(width = 0.08, alpha = 0.6) +
  ylim(0,2500)+
  labs(title = "Violin Plot of hwy by class",
                        x = "airlines", y = "distance")

###summary 2

Question 6

Question:

Are the delayed flgihts are more in number of specific airline and factors affecting that airlines more?

Answer: From the below summary we can see that specific airline like united airlines are 3115 and americam: 538, delta 330 are delayed for the delat airways the avg delay is around 74.9 and american has 66.6 and united around 64.0, similarly the arrival delay is also affected where 79.7, 74.1 and 69.6 respectively. They also have more number of wheels on time.

delayed_data_summary <-delayed_data %>% group_by(airline) %>% summarise(Number_of_delayed_flights = n(),
                                             Avg_of_departure_delays=mean(dep_delay),
                                             Avg_of_arrival_delays=mean(arrival_delay),
                                             Avg_wheels_on=mean(wheels_on))
                                             
ontime_data_summary <-ontime_data %>% group_by(airline) %>% summarise(Number_of_delayed_flights = n(),
                                             Avg_of_departure_delays=mean(dep_delay),
                                             Avg_of_arrival_delays=mean(arrival_delay),
                                             Avg_wheels_on=mean(wheels_on))
delayed_data_summary
ontime_data_summary

####summary 3

Question 7

Are the air_port functioning like taxi_out, taxi_in, departure delays affetcing the delay of flights arrival?

Question: Yes we can see that the average departure delay from Fort Lauderable_hollywwod airport is more and the highest departure delays is around 420 and next is daniel k inouye with 112 min delay at the same time the taxi out time is 18 min and 16 min respectively at these airports.

Answer:

delayed_data %>% group_by(dest_airport_name) %>%
  summarise(Avg_Departure_delays=mean(dep_delay),
            Avg_wheels_on=max(wheels_on),
            Min_taxi_out_time=min(taxi_out),
            Min_taxi_in=min(taxi_in)) %>%
            arrange(desc(Avg_Departure_delays))

summary 4

Question 8

Question: Are any specific week days are affecting the delay in arrival of the flight at destination ?

Answer:

We can see that the flight are more delayed in Thursday, friday and on the starting day of the week on monday there are around 893 delayed on thursday and 803 on frinday and monday around 811, from the summary we can see moday 63.6 avg delay and taxi_out is 27.9 similarly in monday it is 68.1 on monday.

delayed_week_summary <-delayed_data %>% group_by(weekday) %>% summarise(No_of_delayed_flights_days = n(),
                                             Avg_of_departure_delays=mean(dep_delay),
                                             Avg_of_taxi_in=mean(taxi_in),
                                             Avg_taxi_out=mean(taxi_out)) %>%         arrange(desc(No_of_delayed_flights_days))

delayed_week_summary

#plot 6

week_data <- select(delayed_data,weekday, airline, arrival_delay)

weekday_count<-week_data %>% count(weekday, name = 'Delayed_flight_count_of_airline',sort = TRUE)
weekday_count
ggplot(data = week_data, mapping = aes(x = arrival_delay , fill = weekday)) +
       geom_histogram( color = "white", bins = 10) +
      facet_wrap( ~ weekday, nrow = 1) +xlim(0,500)+
       labs(title = "Distribution of Resting Blood Pressure",
            x = "Resting Blood Pressure",
            y = "Delayed_flight_count_of_airline")

ggplot(data = ontime_data, mapping = aes(x = arrival_delay , fill = weekday)) +
       geom_histogram( color = "white", bins = 15) + 
      facet_wrap( ~ weekday, nrow = 1) +
       labs(title = "Distribution of Resting Blood Pressure",
            x = "Arrival delay",
            y = "Delayed_flight_count_of_airline")

Summary of Results

Write an executive summary of your overall findings and recommendations to the executives at Dulles Airport. Think of this section as your closing remarks of a presentation, where you summarize your key findings and make recommendations on flight operations and strategies for minimizing flight delays.

Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:

  1. An introduction where you explain the business problem and goals of your data analysis

    • What problem(s) is this company trying to solve? Why are they important to their future success?

    • What was the goal of your analysis? What questions were you trying to answer and why do they matter?

  2. Highlights and key findings from your Exploratory Data Analysis section

    • What were the interesting findings from your analysis and why are they important for the business?

    • This section is meant to establish the need for your recommendations in the following section

  3. Your recommendations to the company

    • Each recommendation must be supported by your data analysis results

    • You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation

    • You must also describe the potential business impact of your recommendation:

      • Why is this a good recommendation?

      • What benefits will the business achieve?

Executive Summary

Please write your executive summary below. If you prefer, you can type your summary in a text editor, such as Microsoft Word, and paste your final text here.

Introduction:

Flight delays are a severe issue that costs airlines, passengers, and the United States’ economy. A greater knowledge of how weather affects aircraft can aid in the development of forecasts and the reduction of the risk of flight delays.

Key features and findings:

Larger airlines are more likely to experience delays, whereas smaller and less popular airlines are less likely to experience delays. Larger carriers, such as United Airlines and American Airlines, experienced less delays than Delta and Southwest Airlines, suggesting that they may be a more trustworthy alternative when flying. According to the line plot, the days with the least delays were Monday, Tuesday, and Friday. These may be the days you prefer to travel on in the hopes of meeting the fewest delays possible. The most delays occur in the late morning and afternoon, according to the last three visualizations that assessed aircraft delays by airline and hour. Early in the morning, there are minimal delays, and later that evening and night, there are less delays. According to the visuals, flying during these times will shorten your journey and allow you to spend less time on the ground.

The frequency of the same airline flights must be evenly divided throughout the day.

Recommedation to the company:

The ground crew and air traffic employees of certain airlines, such as United, should be increased, since they will be in a better position to manage flights, such as boarding passengers, checking flight status quickly, and issuing a signal to fly, reducing aircraft taxi times. Increasing and training them will decrease the Taxi-in and taxi-out and departure delays. Airlines such as Fronteir Skywest Jet must enhance their flights in order to increase the pace and frequency with which they can be handled. The flights schedule must be changed with analyzing the time they are departed, so that the flight delays are minimized. From analysis we can recommend that the flight departure schedule can be changed and some of the flights can be scheduled during the 0 (12:00 a.m) - 5 a.m so that the spread of the flights departure is even round the clock which can show impact on flights arrival. West and south region must concentrate on departure delays to reduce the flight delays. Many airlines must prepare ahead for the month of December, and aircraft and ground crews must communicate well to reduce air traffic during that period.

If these recommendations the airline will decrese the frequent delays and increase the profits, this analysis would help increasing the airline business.